Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Improved CHAID Algorithm for Document Structure Modelling

Identifieur interne : 000776 ( Main/Exploration ); précédent : 000775; suivant : 000777

Improved CHAID Algorithm for Document Structure Modelling

Auteurs : Abdel Belaïd [France] ; T. Moinel [France] ; Y. Rangoni [France]

Source :

RBID : Pascal:10-0429692

Descripteurs français

English descriptors

Abstract

This paper proposes a technique for the logical labelling of document images. It makes use of a decision-tree based approach to learn and then recognise the logical elements of a page. A state-of-the-art OCR gives the physical features needed by the system. Each block of text is extracted during the layout analysis and raw physical features are collected and stored in the ALTO format. The data-mining method employed here is the "Improved CHi-squared Automatic Interaction Detection" (I-CHAID). The contribution of this work is the insertion of logical rules extracted from the logical layout knowledge to support the decision tree. Two setups have been tested; the first uses one tree per logical element, the second one uses a single tree for all the logical elements we want to recognise. The main system, implemented in Java, coordinates the third-party tools (Omnipage for the OCR part, and SIPINA for the I-CHAID algorithm) using XML and XSL transforms. It was tested on around 1000 documents belonging to the ICPR'04 and ICPR'08 conference proceedings, representing about 16,000 blocks. The final error rate for determining the logical labels (among 9 different ones) is less than 6%.


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">Improved CHAID Algorithm for Document Structure Modelling</title>
<author>
<name sortKey="Belaid, A" sort="Belaid, A" uniqKey="Belaid A" first="A." last="Belaïd">Abdel Belaïd</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>LORIA-University Nancy 2, Campus Scientifique, B.P. 239</s1>
<s2>Vandœuvre-Lès-Nancy</s2>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>France</country>
<placeName>
<settlement type="city">Vandœuvre-lès-Nancy</settlement>
<settlement type="city" wicri:auto="agglo">Nancy</settlement>
</placeName>
<placeName>
<settlement type="city">Nancy</settlement>
<region type="region" nuts="2">Alsace-Champagne-Ardenne-Lorraine</region>
<region type="region" nuts="2">Région Lorraine</region>
</placeName>
<orgName type="laboratoire" n="5">Laboratoire lorrain de recherche en informatique et ses applications</orgName>
<orgName type="university">Université de Lorraine</orgName>
<orgName type="institution">Centre national de la recherche scientifique</orgName>
<orgName type="institution">Institut national de recherche en informatique et en automatique</orgName>
</affiliation>
</author>
<author>
<name sortKey="Moinel, T" sort="Moinel, T" uniqKey="Moinel T" first="T." last="Moinel">T. Moinel</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>LORIA-University Nancy 2, Campus Scientifique, B.P. 239</s1>
<s2>Vandœuvre-Lès-Nancy</s2>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>France</country>
<placeName>
<settlement type="city">Vandœuvre-lès-Nancy</settlement>
<settlement type="city" wicri:auto="agglo">Nancy</settlement>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Rangoni, Y" sort="Rangoni, Y" uniqKey="Rangoni Y" first="Y." last="Rangoni">Y. Rangoni</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>LORIA-University Nancy 2, Campus Scientifique, B.P. 239</s1>
<s2>Vandœuvre-Lès-Nancy</s2>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>France</country>
<placeName>
<settlement type="city">Vandœuvre-lès-Nancy</settlement>
<settlement type="city" wicri:auto="agglo">Nancy</settlement>
</placeName>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">10-0429692</idno>
<date when="2010">2010</date>
<idno type="stanalyst">PASCAL 10-0429692 INIST</idno>
<idno type="RBID">Pascal:10-0429692</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000164</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000613</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000150</idno>
<idno type="wicri:doubleKey">0277-786X:2010:Belaid A:improved:chaid:algorithm</idno>
<idno type="wicri:Area/Main/Merge">000781</idno>
<idno type="wicri:Area/Main/Curation">000776</idno>
<idno type="wicri:Area/Main/Exploration">000776</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">Improved CHAID Algorithm for Document Structure Modelling</title>
<author>
<name sortKey="Belaid, A" sort="Belaid, A" uniqKey="Belaid A" first="A." last="Belaïd">Abdel Belaïd</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>LORIA-University Nancy 2, Campus Scientifique, B.P. 239</s1>
<s2>Vandœuvre-Lès-Nancy</s2>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>France</country>
<placeName>
<settlement type="city">Vandœuvre-lès-Nancy</settlement>
<settlement type="city" wicri:auto="agglo">Nancy</settlement>
</placeName>
<placeName>
<settlement type="city">Nancy</settlement>
<region type="region" nuts="2">Alsace-Champagne-Ardenne-Lorraine</region>
<region type="region" nuts="2">Région Lorraine</region>
</placeName>
<orgName type="laboratoire" n="5">Laboratoire lorrain de recherche en informatique et ses applications</orgName>
<orgName type="university">Université de Lorraine</orgName>
<orgName type="institution">Centre national de la recherche scientifique</orgName>
<orgName type="institution">Institut national de recherche en informatique et en automatique</orgName>
</affiliation>
</author>
<author>
<name sortKey="Moinel, T" sort="Moinel, T" uniqKey="Moinel T" first="T." last="Moinel">T. Moinel</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>LORIA-University Nancy 2, Campus Scientifique, B.P. 239</s1>
<s2>Vandœuvre-Lès-Nancy</s2>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>France</country>
<placeName>
<settlement type="city">Vandœuvre-lès-Nancy</settlement>
<settlement type="city" wicri:auto="agglo">Nancy</settlement>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Rangoni, Y" sort="Rangoni, Y" uniqKey="Rangoni Y" first="Y." last="Rangoni">Y. Rangoni</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>LORIA-University Nancy 2, Campus Scientifique, B.P. 239</s1>
<s2>Vandœuvre-Lès-Nancy</s2>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>France</country>
<placeName>
<settlement type="city">Vandœuvre-lès-Nancy</settlement>
<settlement type="city" wicri:auto="agglo">Nancy</settlement>
</placeName>
</affiliation>
</author>
</analytic>
<series>
<title level="j" type="main">Proceedings of SPIE, the International Society for Optical Engineering</title>
<title level="j" type="abbreviated">Proc. SPIE Int. Soc. Opt. Eng.</title>
<idno type="ISSN">0277-786X</idno>
<imprint>
<date when="2010">2010</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<title level="j" type="main">Proceedings of SPIE, the International Society for Optical Engineering</title>
<title level="j" type="abbreviated">Proc. SPIE Int. Soc. Opt. Eng.</title>
<idno type="ISSN">0277-786X</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Algorithms</term>
<term>Decision trees</term>
<term>Document image processing</term>
<term>Document retrieval</term>
<term>Document structure</term>
<term>Labelling</term>
<term>Modelling</term>
<term>Optical character recognition</term>
<term>Pattern recognition</term>
<term>State of the art</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Algorithme</term>
<term>Reconnaissance forme</term>
<term>Recherche documentaire</term>
<term>Structure document</term>
<term>Modélisation</term>
<term>Etiquetage</term>
<term>Traitement image document</term>
<term>Arbre décision</term>
<term>Etat actuel</term>
<term>Reconnaissance optique caractère</term>
<term>0130C</term>
<term>4230S</term>
<term>4230V</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr">
<term>Recherche documentaire</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">This paper proposes a technique for the logical labelling of document images. It makes use of a decision-tree based approach to learn and then recognise the logical elements of a page. A state-of-the-art OCR gives the physical features needed by the system. Each block of text is extracted during the layout analysis and raw physical features are collected and stored in the ALTO format. The data-mining method employed here is the "Improved CHi-squared Automatic Interaction Detection" (I-CHAID). The contribution of this work is the insertion of logical rules extracted from the logical layout knowledge to support the decision tree. Two setups have been tested; the first uses one tree per logical element, the second one uses a single tree for all the logical elements we want to recognise. The main system, implemented in Java, coordinates the third-party tools (Omnipage for the OCR part, and SIPINA for the I-CHAID algorithm) using XML and XSL transforms. It was tested on around 1000 documents belonging to the ICPR'04 and ICPR'08 conference proceedings, representing about 16,000 blocks. The final error rate for determining the logical labels (among 9 different ones) is less than 6%.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>France</li>
</country>
<region>
<li>Alsace-Champagne-Ardenne-Lorraine</li>
<li>Région Lorraine</li>
</region>
<settlement>
<li>Nancy</li>
<li>Vandœuvre-lès-Nancy</li>
</settlement>
<orgName>
<li>Centre national de la recherche scientifique</li>
<li>Institut national de recherche en informatique et en automatique</li>
<li>Laboratoire lorrain de recherche en informatique et ses applications</li>
<li>Université de Lorraine</li>
</orgName>
</list>
<tree>
<country name="France">
<noRegion>
<name sortKey="Belaid, A" sort="Belaid, A" uniqKey="Belaid A" first="A." last="Belaïd">Abdel Belaïd</name>
</noRegion>
<name sortKey="Moinel, T" sort="Moinel, T" uniqKey="Moinel T" first="T." last="Moinel">T. Moinel</name>
<name sortKey="Rangoni, Y" sort="Rangoni, Y" uniqKey="Rangoni Y" first="Y." last="Rangoni">Y. Rangoni</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000776 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000776 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Pascal:10-0429692
   |texte=   Improved CHAID Algorithm for Document Structure Modelling
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024